The aim of this study is to predict response to anti-TNF therapies in patients with Crohn’s disease. The data was not collected with this purpose in mind, but I think it’s worth checking if the gathered biomarkers have any signal associated with patient response, since this information is available here as well. There is a total of 66 samples in the dataset, with multiple features and 4 ways of categorizing the response (we’ll get into the details later on). Most of the features were measured at 2 time points: right before the therapy administration (baseline) and after it theoretically should take effect (week 14). The patients were given one of the 2 anti-TNF therapies: infliximab or adalimumab. This makes the analysis a bit trickier, because even though the therapies are similar in theory, they are not identical, so it may well turn out that we’ll need to build separate models for each of them (ofc I’m writing this intro somewhat retrospectively so - spoiler alert - we will).
There is a total of 66 samples with 184 features. 489 values are missing. First, let’s see how many features have missing values as well as the top 15 features with the most missing values.
## Features with missing values: 56
## Other 56
## SES-CD after 46
## SES-CD before 31
## VEGF-C before, pg/ml 27
## sIL-6R before, ng/ml 27
## sIL-6R after, ng/ml 27
## VEGF-C after, pg/ml 27
## hsTNF after, pg/ml 22
## hsTNF before, pg/ml 22
## Serpin A12 (Vaspin) before, pg/ml 12
## Serpin A12 (Vaspin) after, pg/ml 12
## IL-17 after, pg/ml 9
## CCL20 before, pg/ml 9
## CCL20 after, pg/ml 9
## MTT before (%SFM) 9
## dtype: int64
As we can see, many features have some missing values and some of those features are missing most of the values.🙃 Let’s do an analogous check for the samples.
## Samples with missing values: 64
## No
## 49 30
## 56 26
## 38 20
## 50 19
## 48 19
## 60 16
## 62 16
## 40 15
## 31 13
## 52 13
## 37 13
## 1 13
## 51 11
## 33 11
## 11 11
## dtype: int64
Well, almost all samples have at least one missing value + even though one of the samples has 30 missing values, it’s still not that bad. Given what we know so far, let’s deal with it this way. Since we only have 66 samples, each one is priceless, so we’ll start by dropping the features with the most missing values, say - more than 10. According to the previous table, we should drop 11 features and this leaves us with 173.
Now, let’s check how this affected the samples.
## Samples with missing values: 27
Ok, that’s waaay better.
After this rather straightforward move, let’s see what we can do about the rest of the missing data.
Let’s visualize the remaining missing values.
Even though we can clearly see some patterns here, let’s leave it at that for now and we’ll handle the rest of the missing values later on with some imputing method 🚩.
Most of the features come in 3 flavors: baseline, week 14 and delta (either absolute or % change). First of all, let’s remove the duplicated ‘Drug (IFX-1, ADA-2)’ feature. Next, let’s get rid of the absolute delta features and the after therapy features, since we don’t want to use them for prediction.
Now we’re left with 66 features, several of which are not strictly numerical. The plan is as follows:
Let’s do the necessary transformations, set the correct types for each feature, and plot their distributions.
We end up with 66 samples and 69 features distributed as follows.
Clearly, some features are very skewed, but we’ll leave them as they are for now 🚩 and look at the correlations between them.
OK!
There are a few highly correlated features, but not nearly as many as I expected!
Looking at individual correlations, we can see that the most correlated features are:
and after that we’re getting below 0.9 territory. Hemoglibin and hematocrit are obviously correlated, as are MCV and MCH, although they are not 100% reduntant. The rest of these correlations form a kind of correlation triangle, where each feature is correlated with the other two, but I don’t feel competent enough yet to say if any of them can be removed. I’ll leave them as they are for now 🚩.
According to the provided data, there are 4 different ways of defining if a given patient responded to the therapy:
Now, I looked up CDAI and I have some thoughts about it.
Namely…
I really don’t like it 🙃.
It’s very subjective and what’s worse - it’s not an actual number! Using this score to predict the response is probably not going to work - especially given the limited amount of data we have. I’d feel much more comfortable using some “harder” indicator like calprotectin or lactoferin as a response definition, but I guess we’ll have to work with what we have. I’ll add normal CRP as an additional response definition - just out of curiocity - although I realize it’s probably not a good idea.
We can also try to predict the actual value of CDAI after therapy, as well as the difference between the values before and after therapy.
Let’s start the statistical analysis with some simple tests to see if there are any significant differences between the groups. To do this, we’ll need to handle the missing values first. I don’t want to do it permanently here, because I’ll need the missing values for the machine learning part, so I’ll do it just for the statistical analysis by filling them with the mean value of the feature.
Let’s also pull one more trick out of our sleeve and add some random features to the data. This will allow us to see if the features we have are actually better than random at predicting the response. Let’s add 10 random features to the data and see what happens.
Now we’re ready to start testing for significant differences in feature values between responders and non-responders, according to our 5 response definitions. We perform a series of statistical tests comparing all features within these groups. Each feature is first tested for normality within each group using a Shapiro-Wilk test and for equality of variances using Levene test. If these conditions were met - we perform a t-test, otherwise - we use a non-parametric Mann-Whitney U test. Regardless of the test, the null hypothesis is that there are no differences between the groups. I’ll show only the results below \(\alpha=0.05\).
## Test results for Response CDAI:
## Group counts: 18 | 48
## - Localization colon: [Mann-Whitney U test] p-value = 0.0205; mean difference = 0.2778
## - Age: [Mann-Whitney U test] p-value = 0.0259; mean difference = -3.9375
## Test results for Response CDAI + CRP:
## Group counts: 32 | 34
## - CRP: [Mann-Whitney U test] p-value = 0.007; mean difference = -12.3513
## - Age: [Mann-Whitney U test] p-value = 0.0141; mean difference = -3.557
## - _____RANDOM_7_____: [Mann-Whitney U test] p-value = 0.0235; mean difference = 0.1687
## - Mono: [Mann-Whitney U test] p-value = 0.0336; mean difference = -0.1515
## Test results for Remission CDAI:
## Group counts: 24 | 42
## - CDAI: [Mann-Whitney U test] p-value = 0.0011; mean difference = -72.4345
## - Previous anti-TNF therapy: [Mann-Whitney U test] p-value = 0.0318; mean difference = -0.2619
## - Male: [Mann-Whitney U test] p-value = 0.0377; mean difference = 0.2679
## Test results for Remission CDAI + CRP:
## Group counts: 33 | 33
## - CDAI: [t-test] p-value = 0.0008; mean difference = -75.0606
## - CRP: [Mann-Whitney U test] p-value = 0.0161; mean difference = -11.3182
## - VEGF-B-Ab: [Mann-Whitney U test] p-value = 0.0433; mean difference = 7.5333
## - Age: [Mann-Whitney U test] p-value = 0.0472; mean difference = -3.0303
## - VEGFR1-Ab: [Mann-Whitney U test] p-value = 0.049; mean difference = 7.7212
## Test results for Response CRP:
## Group counts: 23 | 43
## - CRP: [Mann-Whitney U test] p-value = 0.0001; mean difference = -19.2088
## - Resistin: [Mann-Whitney U test] p-value = 0.0077; mean difference = -5.9911
## - CDAI: [t-test] p-value = 0.0082; mean difference = -62.9838
## - Mono: [Mann-Whitney U test] p-value = 0.0105; mean difference = -0.1982
## - Fibrinogen: [Mann-Whitney U test] p-value = 0.0132; mean difference = -79.0744
## - VEGFR1-Ab: [Mann-Whitney U test] p-value = 0.0246; mean difference = 6.5513
## - Steroids: [Mann-Whitney U test] p-value = 0.0451; mean difference = -0.2599
Several observations here. First of all, vast majority of the performed test are Mann-Whitney U tests, which means that the data is not normally distributed. Not surprising, given the skewness of the features we’ve seen earlier + small groups. Secondly, there are very few significant differences between the groups. CRP and CDAI before treatment are not surprising, because their values after the therapy are used to define the response. We can also see some random features popping up, which I find a bit unsettling, but that’s exactly why I added them in the first place - as a reminder of how thin the ice we’re standing on is.
There’s also one more catch.
We’re performing a gazillion tests here, so we should probably adjust the p-values for multiple comparisons, but judging by the results, we can already see that it’s simply going to leave us with no significant differences at all.
And I think this is actually the correct outcome here.
I suspect this is due to a couple of reasons:
So let’s try the statistical analysis again, but this time we’ll split the data into two groups based on the treatment and perform the tests separately for each group.
## ----- Infliximab -----
## Test results for Response CDAI:
## Group counts: 6 | 26
## - _____RANDOM_7_____: [t-test] p-value = 0.0258; mean difference = 0.2859
## - _____RANDOM_2_____: [Mann-Whitney U test] p-value = 0.0289; mean difference = -0.2673
## - 5-ASA: [Mann-Whitney U test] p-value = 0.0315; mean difference = 0.2949
## - Form perinanal: [Mann-Whitney U test] p-value = 0.0352; mean difference = -0.3846
## - Age: [Mann-Whitney U test] p-value = 0.0397; mean difference = -6.9231
## - Ang-2: [t-test] p-value = 0.0448; mean difference = 615.4107
## Test results for Response CDAI + CRP:
## Group counts: 15 | 17
## - CRP: [Mann-Whitney U test] p-value = 0.0182; mean difference = -18.16
## - BMI: [Mann-Whitney U test] p-value = 0.0328; mean difference = -2.9573
## - Previous anti-TNF therapy: [Mann-Whitney U test] p-value = 0.039; mean difference = -0.3569
## Test results for Remission CDAI:
## Group counts: 12 | 20
## - Previous anti-TNF therapy: [Mann-Whitney U test] p-value = 0.0036; mean difference = -0.5167
## - Height: [t-test] p-value = 0.0054; mean difference = 8.1218
## - CDAI: [t-test] p-value = 0.0429; mean difference = -78.2167
## - Age: [Mann-Whitney U test] p-value = 0.0488; mean difference = -5.5333
## - Male: [Mann-Whitney U test] p-value = 0.0489; mean difference = 0.3667
## Test results for Remission CDAI + CRP:
## Group counts: 15 | 17
## - Previous anti-TNF therapy: [Mann-Whitney U test] p-value = 0.0051; mean difference = -0.4824
## - CDAI: [t-test] p-value = 0.0261; mean difference = -82.7961
## - Height: [t-test] p-value = 0.0391; mean difference = 6.01
## - CRP: [Mann-Whitney U test] p-value = 0.0395; mean difference = -17.5451
## Test results for Response CRP:
## Group counts: 12 | 20
## - CRP: [Mann-Whitney U test] p-value = 0.0031; mean difference = -24.18
## - Steroids: [Mann-Whitney U test] p-value = 0.0071; mean difference = -0.5
## - Fibrinogen: [Mann-Whitney U test] p-value = 0.0291; mean difference = -103.3801
## - Probiotics: [Mann-Whitney U test] p-value = 0.0314; mean difference = -0.3833
## - Previous anti-TNF therapy: [Mann-Whitney U test] p-value = 0.0314; mean difference = -0.3833
## - CDAI: [t-test] p-value = 0.0382; mean difference = -79.95
## ----- Adalimumab -----
## Test results for Response CDAI:
## Group counts: 12 | 22
## - VEGF-B-Ab: [Mann-Whitney U test] p-value = 0.021; mean difference = 9.0424
## - Mono: [Mann-Whitney U test] p-value = 0.0269; mean difference = -0.2364
## - VEGF-A-Ab: [Mann-Whitney U test] p-value = 0.035; mean difference = 11.8174
## - _____RANDOM_4_____: [t-test] p-value = 0.0397; mean difference = -0.1775
## - MCP-1: [t-test] p-value = 0.0399; mean difference = -76.0964
## Test results for Response CDAI + CRP:
## Group counts: 17 | 17
## - Mono: [Mann-Whitney U test] p-value = 0.0202; mean difference = -0.2118
## - _____RANDOM_7_____: [t-test] p-value = 0.0212; mean difference = 0.2223
## - Age: [Mann-Whitney U test] p-value = 0.031; mean difference = -4.1176
## Test results for Remission CDAI:
## Group counts: 12 | 22
## - VEGF-B-Ab: [Mann-Whitney U test] p-value = 0.0065; mean difference = 9.3129
## - _____RANDOM_9_____: [Mann-Whitney U test] p-value = 0.009; mean difference = -0.2348
## - CDAI: [t-test] p-value = 0.0145; mean difference = -65.0379
## - VEGF-A-Ab: [Mann-Whitney U test] p-value = 0.0267; mean difference = 11.7917
## - Probiotics: [Mann-Whitney U test] p-value = 0.0365; mean difference = 0.3788
## Test results for Remission CDAI + CRP:
## Group counts: 18 | 16
## - CDAI: [t-test] p-value = 0.0034; mean difference = -73.0764
## - VEGF-B-Ab: [Mann-Whitney U test] p-value = 0.0136; mean difference = 10.3792
## - _____RANDOM_9_____: [Mann-Whitney U test] p-value = 0.0285; mean difference = -0.1692
## Test results for Response CRP:
## Group counts: 11 | 23
## - CRP: [Mann-Whitney U test] p-value = 0.0071; mean difference = -14.0115
## - Mono: [Mann-Whitney U test] p-value = 0.0216; mean difference = -0.1992
## - Resistin: [Mann-Whitney U test] p-value = 0.0429; mean difference = -6.8588
Now we’ve obviously added another ton of tests, so the results are even less significant, but I think we can see at least some reasonable things here. The fact that patients who had previous anti-TNF therapy are less likely to respond probably makes sense - those are usually the harder cases. Still a poor biomarker though… Asamax is used almost in every patient, so the significance of this result probably is down to a single patient. The fact that we see stuff like height and random features popping up again means that we’re still threading on thin ice here.
I’m curious whether there is a difference in the response between the two treatments. Let’s test that.
## Test results for ADA:
## Group counts: 32 | 34
## No significant differences found
## Test results for ADA:
## Group counts: 32 | 34
## No significant differences found
## Test results for ADA:
## Group counts: 32 | 34
## No significant differences found
## Test results for ADA:
## Group counts: 32 | 34
## No significant differences found
## Test results for ADA:
## Group counts: 32 | 34
## No significant differences found
We see no difference in the response between the two treatments, regardless of the response definition. This ofc doesn’t mean the groups are identical. Just that the rate of response is +/- the same.
Before we move on to the machine learning part, let’s try to look for correlations between the features and CDAI after treatment, the difference in CDAI before and after treatment, and CRP after treatment.
## ----- CDAI -----
## CDAI 0.448339
## IL-6 0.401476
## CRP 0.377670
## MTT 0.311578
## Ferrum 0.271617
## dtype: float64
## Male -0.225733
## VEGFR1-Ab -0.227144
## VEGF-A-Ab -0.236654
## Albumin -0.259265
## _____RANDOM_7_____ -0.268643
## dtype: float64
## ----- CDAI diff -----
## Age 0.264793
## Leptin 0.260375
## Form perinanal 0.184829
## BMI 0.184550
## ADA 0.176960
## dtype: float64
## Ang-2 -0.210232
## Eo -0.224691
## Male -0.248157
## Plt -0.259803
## CDAI -0.435921
## dtype: float64
## ----- CRP -----
## CRP 0.679205
## IL-6 0.403810
## Mono 0.353074
## Fibrinogen 0.332144
## Resistin 0.311421
## dtype: float64
## Fe -0.212601
## Previous surgery -0.217890
## Albumin -0.221739
## 5-ASA -0.225511
## TIBC -0.308417
## dtype: float64
Ok - it’s not surprising, that CDAI before treatment is correlated with CDAI after treatment. Similarly - CRP before treatment is correlated with CRP after treatment. We’ll need to tak this into account when building the models.
Now let’s move on to the machine learning part.
Let’s establish a baseline model. I’ll start with random forest, because it should show us if there’s any signal in the data at all, without any hyperparameter tuning. We just need to add missing value imputation before the model. We’ll use stratified 5-fold cross-validation repeated 4 times with auc roc to evaluate the model and plot the distribution of the scores for each response definition.
Now let’s build the models on the whole dataset with some additional random features and look at the feature importances. We’ll just display the features with higher importance than the best random feature.
Well - this doesn’t look good.
We have random features popping up all over the place.
However, we also see some obvious features like CRP and CDAI before treatment, so I’d get rid of those two and try again.
Well, this looks really bad…
Only the model predicting remission is better than random, but looking at feature importance - it relies on random features 🙃
Of course the model predicting CRP also still looks good, but we don’t know with what else is CRP correlated + it’s probably not a valid response definition.
I would be great if we could get our hands on some calprotectin or lactoferrin!
We need to regroup, but for now, let’s also try predicting the CDAI score after treatment as well as the difference between CDAI before and after treatment. I would also add another definition of difference, which is the difference between CDAI before and after treatment divided by CDAI before treatment, i.e., the relative change. We’ll also add some random features to the dataset.